5  Data management using spreadsheets

Go ahead and download the research compendium for this module using this link, make sure the directory is unzipped and move it to your

Learning Objectives

After completing this lab you should be able to

  • 1 we will find that dates are the worst in terms of causing chaos… are they numbers? strings? categorical or continuous data?

  • 5.1 Wait… spreadsheets?

    The foundation of any research project is good data organization. This not only includes your actual data points (observations) but also things like keeping track of specimen and other samples along with all the meta-data2. Additionally, you should keep good records of how the data was produced (your methods). Thinking through ahead of time what measurements are important, i.e. what data you will record and how you will store your data is really important to make sure you are keeping track of the entire process. Good data management and clean data sets will make sharing and analyzing data a lot more straightforward.

  • 2 Metadata is data about your data. It helps describe, categorize, organize, and provide context to the main content or primary data it is associated with. It’s commonly used to provide additional information that helps users, systems, or processes understand and interpret the main data.

  • While we are going to use R to wrangle, manipulate, analyze, and visualize data in R for a more efficient and reproducible approach compared to what can be using spreadsheets, spreadsheets are the better tool for data entry, data management/organization, and simple quality control (QC) and quality assurance (QA). Thinking ahead to how you want to organize and format your data in spreadsheets will prevent a lot of extra work and headaches down the line, especially when we plan ahead as to how we should organize it to make it more efficient to use with command-line computational tools such as R.

    So, before we dive deep into manipulating data with R, we’ll take a small step back and think through a few fundamental rules for managing data in spreadsheets before learning how to do these and more advanced data wrangling and manipulation using R.

    While you can do some statistics and plotting using spreadsheet programs we will not be learning how to do this in this class. Data analysis in spreadsheets requires a lot of manual work and generally any time you want to change one parameter or if you have to update your spreadsheet with new entries or you need to apply the same analysis to another data set you end up having to redo everything by hand. The more things you do by hand, the more likely you are to make a mistake. Even if you do apply some sophisticated coding in spreadsheets and/or use it for analysis or plotting3 it is very difficult to track the exact steps or document them in a way that makes it fully reproducible for another person.

  • 3 It can be helpful to know the fundamentals for simple plots in spreadsheets for quick and dirty plotting to get a quick look at your data to get an idea of what it looks like and spot potential mistakes during data entry without having to export data and fire up R or a another command-line program.

  • 5.2 Best practices for formatting data in spreadsheets

    5.2.1 Format your data set for the tool you will use to analyze it with

    Our brains don’t work the same way as computers. Your spreadsheet is not a lab notebook and while a layout where there are notes in the margin, context of the experiment, or a specific layout of data might be something that you can interpret, it will more difficult for another person to forllow your through process and understand your records/notes. Another person might have the opportunity to ask follow up questions and get the clarifications they need, but a computer cannot.

    Occasionally, you might use a spreadsheet instead of a lab notebook where it is a way of keeping track of an experiment and various people interacting with samples, completing different steps etc. However, if you are using a spreadsheet for data entry and management, then you want to ensure that you have set up your spreadsheet in a way where a computer is going to be able to correctly interpret it as intended. This means that we need to think through how we want to set up spreadsheets. There are generally a few different ways you can set things up and some of them will limit how easy it is for you and/or a future collaborator to work with it down the line4.

  • 4 the optimum software/interface for data input and layout/formatting may differ depending on your intended analysis, so keep in mind how you want to be able to analyze your data and whether that will require specific formats. In general, try to pick a format that will give you the advantage of being able to easily convert it between different formats - which is something we will learn to do with R specifically in the tidyverse which centers on a specific concept of what makes data tidy.

  • 5.2.2 Never touch your raw data

    Raw data is the original, unaltered data that is collected or generated before any manipulation, transformation, or analysis takes place. It’s the most fundamental form of data, directly obtained from observations, measurements, or data sources. Raw data is often in its most unstructured and basic state, and it serves as the foundation for subsequent data processing and analysis.

    In the biological and environmental sciences typical sources include direct observations made in experiments, field studies, or natural phenomena or measurments from sensors or other instruments measuring physical and chemical parameters such as temperature, GPS coordinate, pH etc.

    Be mindful

    Never touch your raw data. Always keep a copy of your raw data that you never modify directly.

    For any kind of data related work it is important that you preserve the original, unaltered version of your data when conducting data analysis. Avoid making changes directly to the original data files or data set. Instead, You should work with copies of the data or use a structured workflow that ensures the integrity and reproducibility of your analysis.

    Keeping your raw data as a separate file that is never altered is important for

    • Data Integrity: Modifying the raw data directly could lead to unintended changes or loss of information. By keeping the raw data untouched, you ensure that you have a reliable source of truth to refer back to if needed.
    • Reproducibility: If you or others need to replicate your analysis in the future, having access to the exact original data is crucial. Changes made to the raw data could make it difficult or impossible to reproduce your results accurately.
    • Error Prevention: Working with copies of the raw data minimizes the risk of accidental changes or mistakes that could affect your analysis. If errors occur, you can always go back to the untouched raw data.
    • Data Auditing: In some cases, you might need to show the authenticity and accuracy of your data. Keeping the original data untouched allows you to demonstrate the reliability of your work.
    • Multiple Analyses: If you’re working on different analyses, projects, or collaborations using the same data, maintaining the integrity of the raw data enables consistency across these efforts.

    Best practices of maintaining the integrity of your raw data include * making copies: Always work with copies of the original data files or datasets. This ensures that any changes you make are isolated from the raw data. * implementing a structured workflow: Develop a structured workflow that includes data cleaning, transformation, and analysis steps. Document each step thoroughly to ensure transparency and repeatability. * using version vontrol: Use version control systems like Git to track changes to your code and analysis scripts. This allows you to see how your analysis evolves over time. * creating backups: Regularly back up your data, including both the raw data and any processed versions, to prevent data loss. * creating documentation: Maintain clear and detailed documentation about the steps you’ve taken, the rationale behind your decisions, and any changes you’ve made to the data.

    5.2.3 Keep track of your formatting steps

    By working with copies and following a structured workflow, you can ensure the accuracy, reproducibility, and integrity of your data analysis work. While you shouldn’t modify the raw data directly, it’s also important to apply necessary data preprocessing steps (like cleaning and transforming) as part of your analysis process. This means that you should do two things

    1. Any time you need to process or analyze your data make a copy instead of operating directly in your raw data5 and then create a new file with your cleaned or analyzed data.
    2. Keep track of the exact steps you took to clean or analyze your data6; this is just as important as keeping a detailed record of the steps you took in an experiment. Good practice would be to keep a plain text file or similar in the same folder as your data set where you record any steps you take.
  • 5 In our next lesson you are going to see that this is a key advantage of command line programs like R where you can read a raw data set into the program and then apply specific data wrangling, manipulation and analysis steps without altering the raw data.

  • 6 The second advantage of command-line programs like R is that because you are using a series of commands to wrangle and analyze your data your are automatically creating a very detailed, reproducible record of your your workflow

  • 5.2.4 Put variables in columns and observations in rows

    Observations and variables are two fundamental concepts that describe different aspects of data.

    Consider this

    Compare and contrast what an observation is compared to a variable.

    An observation is a single data point or unit within a data set, while a variable is a characteristic that is being measured or observed for each of those data points.

    Together, observations and variables make up the structure of a data set, where each observation has values for the various variables being measured.

    Here is the key rule for structuring data in spreadsheet:

    Every variable gets its own column, every observation gets its own rule, do not combine multiple pieces of information in one cell7.

  • 7 This is what we will refer to as tidy data. We will explore this concept more in depth down the line and will play “Is it tidy?” regularly this semester.

  • 5.2.5 Export data as text-based formats

    While it is a lot easier to enter and look at data in a spreadsheet you should always export your raw and cleaned data set as a text-based format such as CSV, TSV, JSON, XML, etc.).

    This offers several advantages compared to proprietary binary formats8:

  • 8 an example would be Excel’s .xlsx

    • Interoperability: Text-based formats are widely supported by various software and programming languages. This means that data can be easily shared and integrated into different applications and systems, regardless of the software being used. Especially if program have proprietary formats having a format that is platform independent is really important.
    • Simplicity: Text-based formats have a simple, human-readable structure. This makes it easier to understand the data’s content, and it allows manual inspection and editing using basic text editors.
    • Version Control: Text-based formats work well with version control systems (e.g., Git). Since changes can be easily tracked in plain text, it’s easier to collaborate, review, and manage changes made to the data. & Data Integrity: Text-based formats are less prone to corruption and data loss. Proprietary binary formats can sometimes become corrupted, making data recovery difficult.
    • Platform Independence: Text-based formats are platform-independent. They can be used on different operating systems without compatibility issues
    • Reduced File Size: Text-based formats generally have smaller file sizes compared to their binary counterparts. This can be advantageous for sharing and storage, especially when dealing with large datasets.
    • Automation and Scripting: Text-based formats are well-suited for automation and scripting tasks. Many programming languages have libraries and tools to read and write data from these formats easily.
    • Data Analysis: Text-based formats can be directly used in data analysis workflows. Data scientists and analysts often use tools like Python, R, and SQL to work with text-based data formats.
    • Data Accessibility: When sharing data with others, especially outside your organization, text-based formats offer a universal way to provide data that can be imported into various tools without compatibility issues.

    5.3 Common spreadsheet formatting issues

    Consider this

    Briefly describe common formatting mistakes formatting data in spreadsheets, explain why it might be tempting to format data in this way and why it might cause downstream issues for data analysis.

    Here are the key points you will want to discuss:

    • multiple tables in one tab
    • data spread across multiple tabs
    • not filling in zeros
    • using problematic null values for missing data
    • using formatting to convey information
    • using formatting to make the data sheet look pretty
    • placing comments or units in cells
    • entering more than one piece of information per cell
    • using problematic field/column names
    • using special characters in data
    • including metadata in the data table
    • date formatting

    5.4 Dates are data (but like, the worst kind)

    Probably the most intuitive way to store dates in a spreadsheet would be to create a column called date and then just store your dates in there.

    Consider this

    Quick. Off the top of your head come up with 10 ways that you could format a date.

    So that’s the first problem - what should a date even look like? Problem two is that while to you the human this would be the most natural way to do this, the spreadsheet might be displaying it in a way that makes sense to you but is actually storing it in a very different format. Additionally, different spreadsheet programs (Microsoft Excel, Google Sheets, LibreOffice, OpenOffice) might be storing and handling dates slightly different from each other. In this case the date functions valid for one might be only somewhat-ish compatible with each other. Additionally, spreadsheet programs generally are trying to automatically recognize dates so e.g. gene/protein names like MAR1, OCT4 would be interpreted to dates and getting the original identifier back might be tricky.

    Additionally spreadsheets sometimes try to be especially helpful by autocompleting information.

    Give it a whirl

    Open the exercise-dates Google Sheets document in the scratch folder of the 01_SharkNurseries folder. Label cell A1 as location and cell B1 as date_sample-1. In cell A2-16 type in A, B, … . Then in cell B2 type a date as just month/day. Hit enter, then click back on the cell and look at the value bar at the top. Describe what you observed and how this “helpful” behavior could lead to data entry problems.

    You can switch between different data formats by customizing the format of the cell.

    Give it a whirl

    In your Google Sheets document type a few dates in cells B3 - B6. Next, highlight column A. and click on the 123 button in the tool bar. Select Custom Date and Time and describe what type of formatting options you have. Pick one, and see how it changes how the content of your cells is displayed.

    For some of the more elaborate formats if you look at the value bar you will notice that even though the content of your cell has changed in terms of how it is formatted, that value might not necessarily match the cell content. How does the spreadsheet program so easily convert between all the different formats?

    Give it a whirl

    Let’s assume that you revisited every site 15 days after the initial visit. In your Google Sheets document type date_sample2 in cell C1. Now in cell C2 type B2 + 15. Describe what happens.

    Wait? Since when can you just add an integer and a date? Aren’t those completely different data formats?

    Give it a whirl

    Highlight column B. Now Click on the 123 button in the toolbar and select Automatic. Describe what you see. Speculate what this means about how spreadsheet programs actually store dates and what implications this could have if you export spreadsheets as text files.

    Google Sheets is actually storing dates as integers from a default day of December 31, 1899. This can be useful, spreadsheet programs were initially developed for and most heavily used for accounting, so the option of being able to in a straightfoward way add days, months, or years to a given date is quite practical.

    However, if you export as a text file you can run into the issue that you text file now has a column with an integer where you expected a date.

    As a side note it also becomes problematic if you are using dates before December 31, 1899 because it cannot parse them correctly.

    It is a lot safer to store dates in spreadsheets not as date but rather in three columns as year, month, day. Another option is to use Julian Day or day of the year. Or you can store as a single string as YYYYMMDD you can do the same for time stamps as YYYYMMDDhhmmss. This has advantages in terms of sorting by assending and descending order and you do not have to worry about converting to text.

    If you were to read this format into R it would initially think that it is an integer, however there are functions that we can use to tell R that it is actually a date and what format it is, and then we would be able to apply functions to extract year, month etc into other programs.

    In sum, treat dates as multiple pieces of data to make them easier to handle downstream.

    5.5 Quality assurance and control

    You will frequently hear people say something along the lines of “oh we still have to QC the data” or “we need to complete QA/QC before we can analyze the data. QA stands for quality assurance and QC stands for quality control and both processes are critical to ensure that data being used moving forward is accurate, reliable and valid.

    Quality assurance focuses on preventing errors and ensuring that the proceses used to generate and enter the data are effective and efficient and minimize error. It involves establishing guidelines, standards, and best practices to be followed during the processes. The goal is to identify and address potential issues before they can occur or at least as early as possible in the process.

    By contrast, quality control focuses on identifying errors that may have occured during the process of generating and entering data by performing checks and tests at various stages of the process to verify that the final output meets the predefined quality standards.

    Ensuring that we have accurate and consistent data collection methods, checking for and removing or correcting data entry errors, and validating data against predefined criteria is a critical step in (data) science. It is important that you keep a good record of the steps you took, rules you apply to discern “good” vs “bad” errors, and which data was removed to ensure transparency and repeatability.

    One important component of quality assurance is stopping from bad data being entered in the first place by creating a list of valid values which will then prohibit false values from being entered. For example, we might be working with different types of gear to catch sharks at each location, longlines, gillnets, and hook-and-line. It would be easy to accidentally mistype one of these gear types or forget whether we are entering everything lowercase or using capitalization.

    Give it a whirl

    In your Google Sheet in column D type gear. Then place your cursor in cell D2 and navigate to Data > Data Validation which will bring up a dialog on the right sight of your screen. Click on Add rule.

    In the Apply to range box it will currently say Sheet1!D2, you can extend this to include all cells from D2 to D6 by modifying this to Sheet1!D2:D6 (or by marking the area in the spreadsheet). Click on the Criteria Dropdown menu to see how many different types of options you have to set rules in terms of what is allowed to be entered into the cells to which you are applying this rule.

    We are going to use a Dropdown. By default you will have two fields. Fill those out as longline and gillnet. Then click on add another item and add hook-and-line.

    Check out the advance options which allows you to change whether you just get a warning or the input is rejected if your entry is invalid, you can also play with the display style to see how that affects the formatting and ease of use. Then click Done.

    We have a short list of options so you you can easily see all three and select the correct one. For longer lists it is more helpful that you can start typing in you data and then select it.

    You can see how this option is helpful for categorical data where typos are an issue. But you can also restrict dates to certain time periods or numbers to certain values.

    Using these types of rules help minimize errors, however it is almost inevitable that something will sneak in eventually which is where quality control comes in.

    Be mindful

    Remember, before you implement any quality control measure you want to make sure that you make a copy of your data and save the original data as your raw, unaltered data set. You will want to make sure that the file name reflects that it is your raw data.

    Create a separate file that you will then clean, make sure that your filename includes some sort of versioning and/or a date so you have a good record of when you processed a data set. Then you want to make sure that your data are all values and not formulas which refer to specific cells. Once you start moving cells around this can screw with your data.

    You will also want to create a text-file (a typical filename would be README) that keeps track of all your files and manipulations so that future you or a collaborator can easily understand and replicate any steps that you take.

    The goal of QC is to find erroneous data. This means that it is generally going to stick out from the rest of the values in a specific column9.

  • 9 Errors are not the same as outliers. Sometimes you know that e.g. certain values cannot be true, for example if all your sample locations where in the northern hemisphere then you cannot have latitudes from the southern hemisphere.

  • We can generally make the assumption that the vast majority of your data is correct. This means that if we sort the values in a column if there are a few errors they will stick out and in many cases they will sort at the very top or very bottom. For example, if your column is numeric anything that is a character will pop out or if you have null values or empty cells they will generally sort to the bottom of a column.

    Be mindful

    Any time you are going to sort data, make sure that you are sorting the entire dataset not just a single column or you will corrupt your data set and everything will end up scrambled.

    Generally, if you don’t have any empty columns or too much missing data if you place the cursor in a cell with a value you can use the shortcut Ctrl/Cmd + A to select all.

    Always double check that you have expanded your sort to the entire data set

    Give it a whirl

    In the scratch folder, open up the catchdata_messy Google Sheet. Then sort each column and see which errors you can spot.

    Make the entire data set is highlighted. Then go select Data > Sort range > Advanced sorting options. Make sure you check the Data has header row option. Then use the sort by drop down menu to select the column you want to sort. Once you are ready, cleock sort.

    Then inspect your column to determine if there are unexpected values and describe your observations. Discuss what you will do with you findings - consider whether it is better (more ethical/responsible) to remove them or correct them.

    Similarly, we can use conditional formatting which allows you to apply specific rules for automatically color coding to a column based on specific rules. This makes it easier for unusual entries or entries outside the possible boundaries to stand out.

    Give it a whirl

    In the catchdata_messy Google Sheet, highlight the STL column. The select Format > Conditional formatting from the toolbar. This will pull up a dialog on the right hand of your window.

    Similar to the Data Validation dialog, you can select the range you want to apply this rule to either by typing it in or selecting it directly in the spreadsheet.

    Click on the Format rules dropdown menus and look through the available options. Let’s say that we know that the sharks cannot be smaller than 50cm or larger than 2m. Set up rules for conditional highlighting that will allow you to quickly pull out invalid entries.

    Click Done once you are all set and evaluate your results.

    Especially for smaller data sets being able to quickly scan for errors can be really helpful, however, down the line we will also learn how to use similar principles to identify errors using R.

    Consider this

    Argue the pros and cons of doing QA/QC directly in the spreadsheet compared to using a command-line program like R.

    5.6 Exporting data

    Generally, want to make sure that we are storing our data in a universally accessible, open, and static format rather than e.g. the default Excel file format (*.xls or *.xlsx).

    • Excel files have a proprietary format and it is possible that in the future technology will change and you will no longer be able to access your files.
    • other program may not be able to read Excel formatted files.
    • different version of Excel may handle data differently which can lead to inconsistencies.
    • frequently journals or grant agencies require you to deposit your data in a data repository that only accepts certain formats which may not include Excel.
    Consider this

    Discuss whether you think Google Sheets has the same problems or if it is an acceptable format to avoid these issues.

    Text-based formats such as comma-delimited (*.csv) or tab-delimited (*.txt or *.txv) files overcome these issues. In CSV files, columnes are separated by commas and in tab-delimited files by tabs10. The advantage of text files is that they can be opened in any plain text editors11 but you can also import them into spreadsheet programs or command-line programs like R.

  • 10 This will look like whitespace if you look at it in a text editor, but tabs, but using whitespace can cause issues when command-line files parse them, tabs are less ambiguous

  • 11 Your operating system will have a built in plain text editors such as notepad. However, you are regularly operating with textfiles it can be helpful to have a more powerful program like Notepad++ or Atom.

  • Give it a whirl

    Open the Excel spreadsheet longline_catchdata.xlsx in the data folder of the project directory you downloaded.

    Select File > Save as from the ribbon, then select Comma Separated Values (*.csv) from the list. Double check the file location and name then click Save.

    Now, repeat the same process to export as a tab-delimited file. You will have multiple options to export as text file, make sure that it says tab-delimited.

    Open both files in a text editor and compare them. If you double click on a *.csv file your computer will typicall open it in excel, you may need to right click and then select open with to open in a text editor.

    You will find that I habitually use tab-delimited files because in Germany we use a , instead of a . for our decimals which means that data values can include commas and therefore exporting as *.csv files can cause a bit of chaos. However, as you pick a data set for your course project you will more likely end up with *.csv files.

    Tip

    Google Sheets now make it a lot easier to export and download copies of spreadsheets in different formats including *.csv by selecting File > Download from the main toolbar and the choosing comma-delimited or tab-delimited from the drop down menu.

    5.7 Acknowledgments

    This chapter is adapted from data carpentries “Data Organization in Spreadsheets for Ecologists lesson.